Implementation Plan: Cloud Snapshot Demo Lifecycle
Branch: 008-cloud-snapshot-lifecycle | Date: 2026-02-27 | Spec: spec.md
Input: Feature specification from /specs/008-cloud-snapshot-lifecycle/spec.md
Summary
Add snapshot-based lifecycle management to the Hetzner Cloud demo infrastructure. Instead of provisioning from scratch every time (~25 min), users snapshot a working cluster once and restore from snapshots in under 5 minutes. Four new Bash scripts (demo-cloud-snapshot.sh, demo-cloud-warm.sh, demo-cloud-cool.sh, demo-cloud-health.sh) extend the existing cloud infrastructure tooling. Snapshot restore bypasses Terraform, using hcloud CLI directly with label-based resource tracking. A post-restore Ansible playbook handles hostname fixup after cloud-init. A standalone health check with auto-remediation verifies all services before demos.
Technical Context
Language/Version: Bash (POSIX-compatible with Bash extensions, matching existing scripts), Ansible 2.16+ (post-restore playbook), Python 3.9+ (date parsing helper, inline)
Primary Dependencies: hcloud CLI 1.42+, Terraform 1.7+ (for cold-build only), Ansible 2.16+, jq (JSON parsing), openssh-client
Storage: Local JSON manifest file (infra/terraform/snapshot-manifest.json), Hetzner Cloud snapshot storage (remote)
Testing: Manual end-to-end testing against live Hetzner Cloud environment (no unit test framework for Bash scripts β follows existing project pattern)
Target Platform: macOS / Linux (developer machines), Docker container (rcd-demo-infra:latest)
Project Type: Infrastructure scripts extending existing IaC project
Performance Goals: Warm-start < 5 min (SC-001), health check < 60 sec (SC-003), snapshot creation < 10 min (SC-002)
Constraints: Must work inside Docker container AND natively; must not modify existing demo playbooks or scenarios; must follow existing script patterns (set -euo pipefail, exit codes, output formatting)
Scale/Scope: 4 new scripts (~200-400 lines each), 1 Ansible playbook (~30 lines), 4 Makefile targets, 1 JSON manifest file
Constitution Check
GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.
| Principle | Status | Notes |
|---|---|---|
| I. Plain Language First | PASS | Scripts use clear info/warn/error messages; health check outputs human-readable table |
| II. Data Model as Source of Truth | PASS | Snapshot manifest is single source for set metadata; cloud labels provide API-based discovery |
| III. Compliance as Code | PASS | Feature extends infrastructure tooling, not compliance controls; existing roles unchanged |
| IV. HPC-Aware | N/A | No HPC-specific considerations for snapshot lifecycle |
| V. Multi-Framework | N/A | No compliance framework interactions |
| VI. Audience-Aware Documentation | PASS | Quickstart.md provides user-facing guide; contracts define technical interface |
| VII. Idempotent and Auditable | PASS | Health check is read-only and idempotent; snapshot/restore are one-shot operations with clear state transitions |
| VIII. Prefer Established Tools | PASS | Uses hcloud CLI (official Hetzner tool), Ansible (existing stack), jq (standard JSON tool); no custom tooling reinvented |
Gate result: PASS β all applicable principles satisfied.
Project Structure
Documentation (this feature)
specs/008-cloud-snapshot-lifecycle/
βββ plan.md # This file
βββ spec.md # Feature specification
βββ research.md # Phase 0: Technical research and decisions
βββ data-model.md # Phase 1: Data model (manifest schema, health report)
βββ quickstart.md # Phase 1: User-facing quick start guide
βββ contracts/
β βββ cli-interface.md # Phase 1: CLI command contracts
βββ checklists/
β βββ requirements.md # Spec quality checklist
βββ tasks.md # Phase 2: Implementation tasks (via /speckit.tasks)
Source Code (repository root)
infra/scripts/
βββ demo-cloud-snapshot.sh # NEW: Create/list/delete snapshot sets
βββ demo-cloud-warm.sh # NEW: Restore cluster from snapshots
βββ demo-cloud-cool.sh # NEW: Graceful session wind-down
βββ demo-cloud-health.sh # NEW: Service health check with auto-remediation
βββ demo-cloud-up.sh # MODIFIED: Add snapshot prompt after successful provisioning
βββ demo-cloud-down.sh # UNCHANGED
βββ check-ttl.sh # UNCHANGED (already supports hcloud label queries)
βββ docker-run.sh # UNCHANGED
infra/terraform/
βββ snapshot-manifest.json # NEW: Local snapshot set metadata (gitignored)
βββ inventory.yml # EXISTING: Generated by warm-start (same format)
βββ *.tf # UNCHANGED
demo/playbooks/
βββ post-restore.yml # NEW: Hostname fixup after snapshot restore
βββ provision.yml # UNCHANGED
βββ scenario-*.yml # UNCHANGED
Makefile # MODIFIED: Add demo-warm, demo-cool, demo-snapshot, demo-health targets
.gitignore # MODIFIED: Add snapshot-manifest.json
Structure Decision: Extends the existing infra/scripts/ directory with 4 new scripts following the established demo-cloud-*.sh naming pattern. One new Ansible playbook in demo/playbooks/ for post-restore hostname fixup. No new directories created β everything fits into existing project structure.
Key Technical Decisions
See research.md for full rationale on each decision.
-
Bypass Terraform for restore β Use hcloud CLI directly. Snapshot-restored clusters are tracked via cloud labels and local manifest, not Terraform state. Avoids state conflicts and simplifies the restore workflow.
-
Post-restore hostname fixup β Cloud-init overwrites FQDN hostnames on boot. A minimal Ansible playbook (
post-restore.yml) restores*.demo.labFQDNs and restarts affected services (~30 seconds). -
Label-based resource discovery β All snapshot-restored resources are labeled with
cluster=rcd-demoandsnapshot-set=<label>. This enables teardown viahcloudselectors and maintains compatibility with existingcheck-ttl.sh. -
Service stop before snapshot β Critical services (FreeIPA, Slurm, Wazuh, Munge) are stopped before snapshotting to protect database consistency. Services restart immediately after snapshot creation completes.
-
Health check with single-retry remediation β On service failure, attempt one
systemctl restart, wait 5 seconds, re-check. Report final status. Handles transient post-boot service ordering issues without masking deeper problems.
Complexity Tracking
No constitution violations to justify. All design choices use established tools (hcloud CLI, Ansible, Bash, jq) and follow existing project patterns.